INN Hotels Project

Student - Ashley Graham

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

Load the data

View the first and last 5 rows of the dataset.

Understand the shape of the dataset.

Check the data types of the columns for the dataset.

Observation

Observations

For the remaining 33% of the bookings that are cancelled it will be useful to analyze the segments that have the most cancellations`

Overall Observations

No_of_previous_cancellations and No_of_previous_bookings_not_canceled appear to carry similar information.

Summary of the dataset

Observation

Exploratory Data Analysis (EDA)

Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Univariate Analysis

No_of_Adults

Observations

No_of_children

Observations

No_of_weekend_nights

Observations

No_of_weeknights

Observations

required_parking_space

Observations

Lead_time

Observations

Arrival_Year

Observations

Arrival_month

Observations

Observations

Repeated_guests

Observations

No_of_previous_cancellations

Observations

No _of_previous_bookings_not_cancelled

Average_price_per_room

No_of_special_requests

Observations

Converting the data types

No_of_adults

Observation

No_of_children

Observations

No_of_weekend_nights

Observations

No_of_week_nights

Observations

Type_of_meal_plan

Observation

required_parking_space

Observation

room_type_reserved

Observation

Arrival_year

Observations

Arrival_month

Observations

Arrival_date

Observations

Market Segment Type

Observations

Repeated_guests

Observations

No_of_previous_cancellations

Observations

no_of_special_requests

Observations

Booking Status

Observations

Bivariate analysis

Observations

Average Room Price vs Market Segment

Percentage of repeated guests that cancelled

Observations

No_of_special_request in Booking Status = 'Cancelled'

Observations

Cancellation by Market Sector

Cancellations by Market Segment and Special Request vs Arrival Month

Observations

Observations

Data Preprocessing

Missing Values

Observation

Duplicate value check

Observation

Outlier detection using box plots

Observation

Treating Outliers

EDA

Average Price Per Room

Observations

Lead time

Observation

Arrival Month vs No of Special requests vs Booking Status

Observations

Data Preparation

Encoding Not_Canceled as 0 and Canceled as 1

Split Data

Building a Model

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a booking status is at risk of being Canceled but in reality the booking will be a Not_Canceled.
  2. Predicting a booking status will be Not_Canceled but in reality the booking will be a Canceled

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Logistic Regression (with Sklearn library)

Checking model performance on training set

Observations

We have build a logistic regression model which shows good performance on the train and test sets but to identify significant variables we will have to build a logistic regression model using the statsmodels library.

Information on VIF

Checking Multicollinearity

Removing no_of_previous_cancellations

Observation

Building a Logistic Regression (with statsmodels library)

Note: The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train2 as the final ones and lg2 as final model.

Coefficient interpretations

Converting coefficients to odds

Coefficient interpretations

Factors increasing the chance of a booking being cancelled

Factors decreasing the chance of a booking being cancelled

Similar interpretations can be done for the other attributes.

Model Performance Evaluation

Checking model performance on the training set

ROC-AUC

Model Performance Improvement

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Model Performance Summary

Let's check the performance on the test set

Dropping the columns from the test set that were dropped from the training set

Using model with default threshold

Using model with threshold=0.37

Using model with threshold=0.41

Final Model Summary

Conclusion

Building a Decision Tree model

Split Data

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a booking status is at risk of being Canceled but in reality the booking will be a Not_Canceled.
  2. Predicting a booking status will be Not_Canceled but in reality the booking will be a Canceled

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Build the Tree

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Do we need to prune the tree?

The decsion tree created is very complex so we will reduce overfitting via two methods Hyperparameter tuning and Cost Complexity Pruning

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

Observations from the tree

Interpretations from other decision rules can be made similarly.

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Maximum value of Recall is at 0.0001 alpha, but if we choose this, the decision tree will be too complex , instead we can choose alpha 0.006 to simplify the tree still get a higher recall.

checking performance on training set

checking performance on test set

Visualizing the Decision Tree

Creating a model with 0.006 alpha

Checking performance on the training set

Checking performance on the test set

Visualizing the Decision Tree

Observation

Model Performance Comparison and Conclusions

Observations

Answers to questions

What are the busiest months in the hotel?

image.png

Which market segment do most of the guests come from?

image-2.png

Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

image-4.png

What percentage of bookings are canceled?

image-5.png

image-3.png

Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Observations

Actionable Insights and Recommendations

Decision Tree Summary

Scenario 1 - If the booking lead time is less than 5 months (151 days), in this scenario , the deciding factors for a cancellation or not cancelled are

Scenario 2 - If the lead time is less than 5 months, special request are made, it doesn’t matter what segment the booking was made from, it is expected to be kept.

Scenario 3 - If the lead time is > 5 months whether or not the price of the room is greater or less than 100 euros, there is a likelihood that the room can be cancelled.

Regression Model Summary

Based on the regression there were some factors such as being a repeated guest, having special request, making the booking through offline or corporate and requiring parking space that decreased the odds of a booking being cancelled.

Some of these were consistent with the decision tree that indicated for example, if lead time was less than 5 months, granted special request were made, the booking was expected to be kept regardless of the method of booking.

Also with the decision tree, if a booking is made through other segments except online and the lead time is less than 3 months, the booking was expected to be kept. This is consistent with the regression model which indicated that there was a 83% and 52% decrease in odds of a booking being cancelled if the booking was made Offline or Corporate respectively. Based on the intersection between the results of the Decision Tree and the logistic regression model, the following Policy for Cancellation and Refunds seems appropriate.

Policy Summary

Cancellations

Cancellation Penalty fees will be applied in the following situations

* If a booking is made from other segments (excluding Online) with a lead time of less than 3 months and there are no special request.
* If a booking is made online with a lead time of less than 13.5 days and there are no special request 
* If the lead time is less than 5 months and special request are made, regardless of if the booking was made online or not.

Penalty fees do not apply if bookings are cancelled more than 151 days (5 months in advance)

Refunds